11 research outputs found

    3PO: Programmed Far-Memory Prefetching for Oblivious Applications

    Full text link
    Using memory located on remote machines, or far memory, as a swap space is a promising approach to meet the increasing memory demands of modern datacenter applications. Operating systems have long relied on prefetchers to mask the increased latency of fetching pages from swap space to main memory. Unfortunately, with traditional prefetching heuristics, performance still degrades when applications use far memory. In this paper we propose a new prefetching technique for far-memory applications. We focus our efforts on memory-intensive, oblivious applications whose memory access patterns are independent of their inputs, such as matrix multiplication. For this class of applications we observe that we can perfectly prefetch pages without relying on heuristics. However, prefetching perfectly without requiring significant application modifications is challenging. In this paper we describe the design and implementation of 3PO, a system that provides pre-planned prefetching for general oblivious applications. We demonstrate that 3PO can accelerate applications, e.g., running them 30-150% faster than with Linux's prefetcher with 20% local memory. We also use 3PO to understand the fundamental software overheads of prefetching in a paging-based system, and the minimum performance penalty that they impose when we run applications under constrained local memory.Comment: 14 page

    Metronome: adaptive and precise intermittent packet retrieval in DPDK

    Full text link
    DPDK (Data Plane Development Kit) is arguably today's most employed framework for software packet processing. Its impressive performance however comes at the cost of precious CPU resources, dedicated to continuously poll the NICs. To face this issue, this paper presents Metronome, an approach devised to replace the continuous DPDK polling with a sleep&wake intermittent mode. Metronome revolves around two main innovations. First, we design a microseconds time-scale sleep function, named hr_sleep(), which outperforms Linux' nanosleep() of more than one order of magnitude in terms of precision when running threads with common time-sharing priorities. Then, we design, model, and assess an efficient multi-thread operation which guarantees service continuity and improved robustness against preemptive thread executions, like in common CPU-sharing scenarios, meanwhile providing controlled latency and high polling efficiency by dynamically adapting to the measured traffic load

    Shape-Based Segmentation of Objects in Cities using LIDAR Data

    No full text
    This paper presents a system for recognizing and segmenting objects in 3D point clouds of urban environments within a probabilistic framework. Our system learns a probabilistic model of object shape from manually labeled training data. We then use this model and a boosting classifier to learn the relationships between recognition hypotheses (object location) and segmentation hypotheses (data points that belong to that object class). Finally, we conduct approximate inference to find the most likely labeling of test data by iteratively updating recognition and segmentation hypotheses until convergence. We evaluate the performance of our algorithm on the car and tree object classes using a truthed LIDAR dataset obtained in NYC, which contains approximately 900 cars and 700 trees

    Achieving high central processing unit efficiency and low tail latency in datacenters

    No full text
    Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2019Cataloged from PDF version of thesis.Includes bibliographical references (pages 95-104).As datacenters have proliferated over the last couple of decades and datacenter applications have grown increasingly complex, two competing goals have emerged for networks and servers in datacenters. On the one hand, applications demand low latency-on the order of microseconds-in order to respond quickly to user requests. On the other hand, datacenter operators require high CPU efficiency in order to reduce operating costs. Unfortunately, today's systems do a poor job of providing low latency and high CPU efficiency simultaneously. This dissertation presents Shenango, a system that improves CPU efficiency while preserving or improving tail latency relative to the state-of-the-art. Shenango establishes that systems today are unable to provide CPU efficiency and low latency simultaneously because they reallocate cores across applications too infrequently. It contributes an efficient algorithm for deciding when applications would benefit from additional cores as well as mechanisms to reallocate cores at microsecond granularity. Shenango's fast core reallocations enable it to match the tail latency of state-of-the-art kernel bypass network stacks while linearly trading throughput for latency-sensitive applications for throughput for batch applications as load varies over time. While Shenango enables high efficiency and low tail latency at endhosts, end-to-end application performance also depends on the behavior of the network. Thus this dissertation also describes Chimera, a proposal for how to build on Shenango to co-design congestion control with CPU scheduling, so that congestion control can optimize for end-to-end latency and efficiency.by Amy Ousterhout.Ph. D.Ph.D. Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Scienc

    Programmable data plane for resource management in datacenters

    No full text
    Thesis: S.M., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2015.Cataloged from PDF version of thesis.Includes bibliographical references (pages 47-51).Network resource management schemes can significantly improve the performance of datacenter applications. However, it is difficult to experiment with and evaluate these schemes today because they require modifications to hardware routers. To address this we introduce Flexplane, a programmable network data plane for datacenters. Flexplane enables users to express their schemes in a high-level language (C++) and then run real datacenter applications over them at hardware rates. We demonstrate that Flexplane can accurately reproduce the behavior of schemes already supported in hardware (e.g. RED, DCTCP) and can be used to experiment with new schemes not yet supported in hardware, such as HULL. We also show that Flexplane is scalable and has the potential to support large networks.by Amy Ousterhout.S.M

    Just in time delivery: Leveraging operating systems knowledge for better datacenter congestion control

    No full text
    Network links and server CPUs are heavily contended resources in modern datacenters. To keep tail latencies low, datacenter operators drastically overprovision both types of resources today, and there has been significant research into effectively managing network traffic [4, 19, 21, 29] and CPU load [22, 27, 32]. However, this work typically looks at the two resources in isolation. In this paper, we make the observation that, in the datacenter, the allocation of network and CPU resources should be co-designed for the most efficiency and the best response times. For example, while congestion control protocols can prioritize traffic from certain flows, this provides no benefit if the traffic arrives at an overloaded server that will only queue the request. This paper explores the potential benefits of such a co-designed resource allocator and considers the recent work in both CPU scheduling and congestion control that is best suited to such a system. We propose a Chimera, a new datacenter OS that integrates a receiver-based congestion control protocol with OS insight into application queues, using the recent Shenango operating system [32]

    Shenango: Achieving high CPU efficiency for latency-sensitive datacenter workloads

    No full text
    Datacenter applications demand microsecond-scale tail latencies and high request rates from operating systems, and most applications handle loads that have high variance over multiple timescales. Achieving these goals in a CPU-efficient way is an open problem. Because of the high overheads of today's kernels, the best available solution to achieve microsecond-scale latencies is kernel-bypass networking, which dedicates CPU cores to applications for spin-polling the network card. But this approach wastes CPU: even at modest average loads, one must dedicate enough cores for the peak expected load. Shenango achieves comparable latencies but at far greater CPU efficiency. It reallocates cores across applications at very fine granularity-every 5 µs-enabling cycles unused by latency-sensitive applications to be used productively by batch processing applications. It achieves such fast reallocation rates with (1) an efficient algorithm that detects when applications would benefit from more cores, and (2) a privileged component called the IOKernel that runs on a dedicated core, steering packets from the NIC and orchestrating core reallocations. When handling latency-sensitive applications, such as memcached, we found that Shenango achieves tail latency and throughput comparable to ZygOS, a state-of-the-art, kernel-bypass network stack, but can linearly trade latency-sensitive application throughput for batch processing application throughput, vastly increasing CPU efficiency.NSF (Grants CNS-1407470, CNS-1526791, CNS-1563826

    Stateless CPU-aware datacenter load-balancing

    No full text
    Today, datacenter operators deploy Load-balancers (LBs) to efficiently utilize server resources, but must over-provision server resources (by up to 30%) because of load imbalances and the desire to bound tail service latency. We posit one of the reasons for these imbalances is the lack of per-core load statistics in existing LBs. As a first step, we designed CrossRSS, a CPU core-aware LB that dynamically assigns incoming connections to the least loaded cores in the server pool. CrossRSS leverages knowledge of the dispatching by each server's Network Interface Card (NIC) to specific cores to reduce imbalances by more than an order of magnitude compared to existing LBs in a proof-of-concept datacenter environment, processing 12% more packets with the same number of cores

    RSS++: load and state-aware receive side scaling

    No full text
    While the current literature typically focuses on load-balancing among multiple servers, in this paper, we demonstrate the importance of load-balancing within a single machine (potentially with hundreds of CPU cores). In this context, we propose a new load-balancing technique (RSS++) that dynamically modifies the receive side scaling (RSS) indirection table to spread the load across the CPU cores in a more optimal way. RSS++ incurs up to 14x lower 95th percentile tail latency and orders of magnitude fewer packet drops compared to RSS under high CPU utilization. RSS++ allows higher CPU utilization and dynamic scaling of the number of allocated CPU cores to accommodate the input load, while avoiding the typical 25% over-provisioning. RSS++ has been implemented for both (i) DPDK and (ii) the Linux kernel. Additionally, we implement a new state migration technique, which facilitates sharding and reduces contention between CPU cores accessing per-flow data. RSS++ keeps the flow-state by groups that can be migrated at once, leading to a 20% higher efficiency than a state of the art shared flow table.QC 20191126</p
    corecore